average_precision_score (Average Precision, AP)#

Average Precision (AP) summarizes a precision–recall (PR) curve into a single number. It is a ranking metric: it cares about how well you order examples by “how positive” they are.

When to reach for AP: imbalanced binary classification (fraud, rare disease, anomaly detection), information retrieval, recommender ranking.

Goals#

  • Build intuition for precision/recall and the PR curve

  • Derive AP and compute it by hand on a tiny example

  • Implement average_precision_score from scratch (NumPy)

  • Visualize what AP measures (Plotly)

  • Use AP for model selection / early stopping in a simple from-scratch logistic regression

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import average_precision_score as sk_average_precision_score
from sklearn.metrics import precision_recall_curve as sk_precision_recall_curve

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

rng = np.random.default_rng(42)

1) Why AP? (especially when positives are rare)#

In many real problems, the positive class is rare:

  • fraud vs non-fraud

  • diseased vs healthy

  • defective vs non-defective

If only 1% of examples are positive, a model that predicts “negative” for everyone gets 99% accuracy — yet it is useless.

A PR curve focuses on the positive class:

  • Precision answers: “Of the items I flagged, how many are truly positive?”

  • Recall answers: “Of all true positives, how many did I find?”

AP compresses the entire PR curve into one number (higher is better). The “no-skill” baseline is the positive prevalence $\(\pi = \frac{\#\text{positives}}{n}.\)$

2) Precision/recall at a threshold#

Assume binary labels \(y_i \in \{0,1\}\) and model scores \(s_i \in \mathbb{R}\) (larger means “more positive”).

For a threshold \(\tau\), predict: $\(\hat{y}_i(\tau) = \mathbb{1}[s_i \ge \tau]\)$

Define counts:

  • \(\mathrm{TP}(\tau)\): predicted 1 and actually 1

  • \(\mathrm{FP}(\tau)\): predicted 1 but actually 0

  • \(\mathrm{FN}(\tau)\): predicted 0 but actually 1

Then: $\(\mathrm{Precision}(\tau) = \frac{\mathrm{TP}(\tau)}{\mathrm{TP}(\tau)+\mathrm{FP}(\tau)}\)\( \)\(\mathrm{Recall}(\tau) = \frac{\mathrm{TP}(\tau)}{\mathrm{TP}(\tau)+\mathrm{FN}(\tau)}\)$

Sweeping \(\tau\) from high to low traces out the PR curve.

Ranking view (top-k)#

Instead of thinking in thresholds, you can sort examples by score (descending) and take the top-\(k\) as “predicted positive”. Let \(P@k\) be the precision among the top-\(k\) items, and \(R@k\) be the recall among the top-\(k\) items.

As \(k\) grows, recall is non-decreasing, and we trace the PR curve point-by-point.

# Tiny ranked example (8 items)
y_true_toy = np.array([1, 0, 1, 0, 1, 0, 0, 1])
y_score_toy = np.array([0.90, 0.85, 0.80, 0.70, 0.65, 0.60, 0.40, 0.30])

order = np.argsort(-y_score_toy, kind="mergesort")
y_sorted = y_true_toy[order]
s_sorted = y_score_toy[order]

tp = np.cumsum(y_sorted)
fp = np.cumsum(1 - y_sorted)

precision_at_k = tp / (tp + fp)
recall_at_k = tp / tp[-1]  # tp[-1] == number of positives

# AP = mean precision at every rank where we encounter a true positive
ap_toy = precision_at_k[y_sorted == 1].mean()
prevalence_toy = y_true_toy.mean()

df_toy = pd.DataFrame(
    {
        "rank": np.arange(1, len(y_true_toy) + 1),
        "score": s_sorted,
        "y_true": y_sorted,
        "TP_cum": tp,
        "FP_cum": fp,
        "precision@k": precision_at_k,
        "recall@k": recall_at_k,
        "contributes_to_AP": (y_sorted == 1),
    }
)

df_toy, ap_toy, prevalence_toy
(   rank  score  y_true  TP_cum  FP_cum  precision@k  recall@k  \
 0     1   0.90       1       1       0     1.000000      0.25   
 1     2   0.85       0       1       1     0.500000      0.25   
 2     3   0.80       1       2       1     0.666667      0.50   
 3     4   0.70       0       2       2     0.500000      0.50   
 4     5   0.65       1       3       2     0.600000      0.75   
 5     6   0.60       0       3       3     0.500000      0.75   
 6     7   0.40       0       3       4     0.428571      0.75   
 7     8   0.30       1       4       4     0.500000      1.00   
 
    contributes_to_AP  
 0               True  
 1              False  
 2               True  
 3              False  
 4               True  
 5              False  
 6              False  
 7               True  ,
 0.6916666666666667,
 0.5)

AP as “average precision at each hit”#

In a ranked list, AP has a convenient interpretation:

\[\mathrm{AP} = \frac{1}{m} \sum_{k: y_{(k)}=1} \mathrm{Precision}@k\]

where \(y_{(k)}\) is the label at rank \(k\), and \(m\) is the number of positives.

So every time you “hit” a true positive in the ranking, you record the current precision; AP is the average of those recorded precisions.

rank = np.arange(1, len(y_sorted) + 1)
pos_mask = y_sorted == 1

fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x=rank,
        y=precision_at_k,
        mode="lines+markers",
        name="precision@k",
    )
)

fig.add_trace(
    go.Scatter(
        x=rank[pos_mask],
        y=precision_at_k[pos_mask],
        mode="markers",
        name="positions where y=1",
        marker=dict(size=10, color="crimson"),
    )
)

fig.add_hline(
    y=ap_toy,
    line_dash="dash",
    line_color="crimson",
    annotation_text=f"AP = {ap_toy:.3f}",
)

fig.update_layout(
    title="Ranked list view: precision@k (AP is the mean at positive hits)",
    xaxis_title="Rank k (top-k predicted positives)",
    yaxis_title="Precision@k",
    yaxis=dict(range=[0, 1.05]),
)

fig.show()
# Build a step PR curve from the ranked list
precision_curve = np.r_[1.0, precision_at_k]
recall_curve = np.r_[0.0, recall_at_k]

# Step-wise area under PR curve (this equals AP)
ap_area = np.sum(np.diff(recall_curve) * precision_curve[1:])

fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x=recall_curve,
        y=precision_curve,
        mode="lines",
        line_shape="hv",
        fill="tozeroy",
        name="PR curve (step)",
    )
)

fig.add_trace(
    go.Scatter(
        x=recall_curve[1:][pos_mask],
        y=precision_curve[1:][pos_mask],
        mode="markers",
        marker=dict(size=10, color="crimson"),
        name="points where recall increases (true positives)",
    )
)

fig.add_trace(
    go.Scatter(
        x=[0, 1],
        y=[prevalence_toy, prevalence_toy],
        mode="lines",
        line=dict(dash="dash", color="gray"),
        name=f"baseline prevalence = {prevalence_toy:.3f}",
    )
)

fig.update_layout(
    title=f"Precision–Recall curve (toy example) — AP = {ap_toy:.3f} (area = {ap_area:.3f})",
    xaxis_title="Recall",
    yaxis_title="Precision",
    xaxis=dict(range=[0, 1.02]),
    yaxis=dict(range=[0, 1.02]),
)

fig.show()
ap_toy, ap_area
(0.6916666666666667, 0.6916666666666667)

3) Average Precision (formal definition)#

A PR curve is a set of points \((R(\tau), P(\tau))\) as we sweep the threshold \(\tau\).

In scikit-learn, average precision is computed as the area under the PR curve using a step function: $\(\mathrm{AP} = \sum_{n} (R_n - R_{n-1})\, P_n,\)\( where the index \)n$ runs over the curve points in increasing recall.

This is equivalent to the ranked-list formula shown earlier (mean of \(\mathrm{Precision}@k\) at every true positive rank).

Two practical notes:

  • AP only depends on the ordering of scores. Any strictly increasing transform of scores (e.g., \(s \mapsto 10s + 3\)) leaves AP unchanged.

  • AP is not the same as trapezoidal “AUPRC”. If you need a specific integration convention, be explicit.

def precision_recall_curve_np(y_true, y_score):
    """Simple PR curve from scores.

    Returns precision and recall in *increasing recall* order, with a starting point (recall=0, precision=1).
    """
    y_true = np.asarray(y_true).astype(int)
    y_score = np.asarray(y_score).astype(float)

    if y_true.ndim != 1 or y_score.ndim != 1 or y_true.shape[0] != y_score.shape[0]:
        raise ValueError("y_true and y_score must be 1D arrays of the same length")

    order = np.argsort(-y_score, kind="mergesort")
    y_sorted = y_true[order]
    s_sorted = y_score[order]

    n_pos = int(y_sorted.sum())
    if n_pos == 0:
        return np.array([1.0]), np.array([0.0]), np.array([])

    tp = np.cumsum(y_sorted)
    fp = np.cumsum(1 - y_sorted)

    precision = tp / (tp + fp)
    recall = tp / n_pos

    precision = np.r_[1.0, precision]
    recall = np.r_[0.0, recall]

    # Thresholds corresponding to each added item (top-k): predict positive if score >= threshold
    thresholds = s_sorted
    return precision, recall, thresholds


def average_precision_score_np(y_true, y_score):
    """Average precision (AP) from scratch.

    Equivalent to the mean of precision@k over ranks k where y_true=1 after sorting by score (descending).
    """
    y_true = np.asarray(y_true).astype(int)
    y_score = np.asarray(y_score).astype(float)

    if y_true.ndim != 1 or y_score.ndim != 1 or y_true.shape[0] != y_score.shape[0]:
        raise ValueError("y_true and y_score must be 1D arrays of the same length")

    order = np.argsort(-y_score, kind="mergesort")
    y_sorted = y_true[order]

    n_pos = int(y_sorted.sum())
    if n_pos == 0:
        return 0.0

    tp = np.cumsum(y_sorted)
    precision_at_k = tp / (np.arange(len(y_sorted)) + 1)
    ap = precision_at_k[y_sorted == 1].sum() / n_pos
    return float(ap)


# Quick sanity checks vs scikit-learn
y_check = rng.integers(0, 2, size=50)
s_check = rng.normal(size=50)

ap_np = average_precision_score_np(y_check, s_check)
ap_sk = sk_average_precision_score(y_check, s_check)

# AP is invariant to strictly increasing transforms of the scores
ap_np_affine = average_precision_score_np(y_check, 10 * s_check + 3)
ap_np_cubic = average_precision_score_np(y_check, s_check**3)

ap_np, ap_sk, ap_np_affine, ap_np_cubic
(0.6068815234220586,
 0.6068815234220585,
 0.6068815234220586,
 0.6068815234220586)

4) Visual intuition: random vs informative vs (almost) perfect scores#

Let’s keep the labels fixed and only change the quality of the scores. A random scoring function should have AP close to the prevalence \(\pi\). As the ranking improves, the PR curve bends upward and AP increases.

n = 3000
prevalence = 0.05
y_sim = (rng.random(n) < prevalence).astype(int)

scores_random = rng.normal(size=n)
scores_some_signal = 2.0 * y_sim + rng.normal(scale=1.0, size=n)
scores_almost_perfect = y_sim + 0.001 * rng.normal(size=n)

models = {
    "random": scores_random,
    "some signal": scores_some_signal,
    "almost perfect": scores_almost_perfect,
}

fig = go.Figure()

for name, scores in models.items():
    prec, rec, _ = precision_recall_curve_np(y_sim, scores)
    ap = average_precision_score_np(y_sim, scores)
    fig.add_trace(
        go.Scatter(
            x=rec,
            y=prec,
            mode="lines",
            line_shape="hv",
            name=f"{name} (AP={ap:.3f})",
        )
    )

fig.add_trace(
    go.Scatter(
        x=[0, 1],
        y=[prevalence, prevalence],
        mode="lines",
        line=dict(dash="dash", color="gray"),
        name=f"baseline prevalence={prevalence:.3f}",
    )
)

fig.update_layout(
    title="PR curves for different score qualities (same labels)",
    xaxis_title="Recall",
    yaxis_title="Precision",
    xaxis=dict(range=[0, 1.02]),
    yaxis=dict(range=[0, 1.02]),
)

fig.show()

5) Practical usage (sklearn)#

For binary classification you should pass scores, not hard labels:

  • y_true: 0/1 labels

  • y_score: probability for the positive class, or a decision score (any real-valued ranking signal)

In multilabel settings, average_precision_score supports average={"micro","macro","weighted","samples"}. Conceptually it computes AP per label (one-vs-rest) and then combines them.

# scikit-learn comparison (binary)
ap_np = average_precision_score_np(y_sim, scores_some_signal)
ap_sk = sk_average_precision_score(y_sim, scores_some_signal)

prec_sk, rec_sk, thr_sk = sk_precision_recall_curve(y_sim, scores_some_signal)
# sklearn returns recall in decreasing order; reverse for plotting
prec_sk_inc = prec_sk[::-1]
rec_sk_inc = rec_sk[::-1]

fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x=rec_sk_inc,
        y=prec_sk_inc,
        mode="lines",
        line_shape="hv",
        name="sklearn precision_recall_curve",
    )
)

fig.update_layout(
    title=f"sklearn PR curve (AP={ap_sk:.3f})",
    xaxis_title="Recall",
    yaxis_title="Precision",
    xaxis=dict(range=[0, 1.02]),
    yaxis=dict(range=[0, 1.02]),
)
fig.show()

ap_np, ap_sk
(0.4679788743919764, 0.4679788743919763)

6) Using AP to optimize a simple model (from-scratch logistic regression)#

AP depends on sorting/ranking, so it is not nicely differentiable with respect to model parameters. In practice you usually:

  • train a model with a differentiable surrogate loss (e.g., log-loss),

  • select hyperparameters / stop training using a metric like AP on a validation set.

Below we train logistic regression from scratch with gradient descent, and use validation AP to pick:

  • the best epoch (early stopping)

  • an \(\ell_2\) regularization strength

# Synthetic imbalanced classification data
X, y = make_classification(
    n_samples=5000,
    n_features=6,
    n_informative=4,
    n_redundant=0,
    n_clusters_per_class=2,
    weights=[0.95, 0.05],
    class_sep=1.2,
    random_state=42,
)

X_train, X_tmp, y_train, y_tmp = train_test_split(
    X, y, test_size=0.4, random_state=42, stratify=y
)
X_val, X_test, y_val, y_test = train_test_split(
    X_tmp, y_tmp, test_size=0.5, random_state=42, stratify=y_tmp
)

# Standardize features using train statistics
mu = X_train.mean(axis=0)
sigma = X_train.std(axis=0)
sigma = np.where(sigma == 0, 1.0, sigma)

X_train_s = (X_train - mu) / sigma
X_val_s = (X_val - mu) / sigma
X_test_s = (X_test - mu) / sigma

y_train = y_train.astype(int)
y_val = y_val.astype(int)
y_test = y_test.astype(int)

(y_train.mean(), y_val.mean(), y_test.mean(), X_train_s.shape)
(0.05366666666666667, 0.054, 0.053, (3000, 6))
def sigmoid(z):
    """Numerically stable sigmoid."""
    z = np.asarray(z)
    out = np.empty_like(z, dtype=float)
    pos = z >= 0
    out[pos] = 1.0 / (1.0 + np.exp(-z[pos]))
    exp_z = np.exp(z[~pos])
    out[~pos] = exp_z / (1.0 + exp_z)
    return out


def predict_proba_logreg(X, w, b):
    return sigmoid(X @ w + b)


def log_loss(y_true, y_prob, l2=0.0, w=None, eps=1e-15):
    y_true = np.asarray(y_true).astype(float)
    y_prob = np.clip(np.asarray(y_prob).astype(float), eps, 1 - eps)
    loss = -np.mean(y_true * np.log(y_prob) + (1 - y_true) * np.log(1 - y_prob))
    if l2 and w is not None:
        loss = loss + 0.5 * l2 * float(np.sum(w * w))
    return float(loss)


def fit_logreg_gd(
    X_train,
    y_train,
    X_val,
    y_val,
    lr=0.1,
    l2=0.0,
    n_epochs=800,
    eval_every=10,
):
    n_samples, n_features = X_train.shape
    w = np.zeros(n_features)
    b = 0.0

    best = {"ap": -np.inf, "epoch": 0, "w": w.copy(), "b": b}
    history = {"epoch": [], "train_loss": [], "val_ap": []}

    for epoch in range(1, n_epochs + 1):
        p = predict_proba_logreg(X_train, w, b)
        err = p - y_train

        grad_w = (X_train.T @ err) / n_samples + l2 * w
        grad_b = float(np.mean(err))

        w -= lr * grad_w
        b -= lr * grad_b

        if epoch == 1 or epoch % eval_every == 0:
            train_loss = log_loss(y_train, predict_proba_logreg(X_train, w, b), l2=l2, w=w)
            val_scores = predict_proba_logreg(X_val, w, b)
            val_ap = average_precision_score_np(y_val, val_scores)

            history["epoch"].append(epoch)
            history["train_loss"].append(train_loss)
            history["val_ap"].append(val_ap)

            if val_ap > best["ap"]:
                best = {"ap": val_ap, "epoch": epoch, "w": w.copy(), "b": b}

    return best, pd.DataFrame(history)


# Train once and track validation AP (early stopping)
best_run, hist = fit_logreg_gd(
    X_train_s, y_train, X_val_s, y_val, lr=0.1, l2=0.1, n_epochs=800, eval_every=10
)

best_run
{'ap': 0.32685299468607276,
 'epoch': 1,
 'w': array([0.    , 0.0044, 0.0011, 0.    , 0.0008, 0.005 ]),
 'b': -0.04463333333333334}
fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x=hist["epoch"],
        y=hist["train_loss"],
        mode="lines",
        name="train log-loss",
        yaxis="y1",
    )
)
fig.add_trace(
    go.Scatter(
        x=hist["epoch"],
        y=hist["val_ap"],
        mode="lines",
        name="val AP",
        yaxis="y2",
    )
)

fig.add_vline(
    x=best_run["epoch"],
    line_dash="dash",
    line_color="crimson",
    annotation_text=f"best val AP @ epoch {best_run['epoch']}",
)

fig.update_layout(
    title="Training curve: optimize log-loss, select by validation AP",
    xaxis_title="Epoch",
    yaxis=dict(title="Train log-loss"),
    yaxis2=dict(title="Validation AP", overlaying="y", side="right", range=[0, 1.0]),
)
fig.show()

test_ap = average_precision_score_np(
    y_test, predict_proba_logreg(X_test_s, best_run["w"], best_run["b"])
)
test_ap
0.3175857876688931
# Use validation AP to choose a hyperparameter (L2 regularization strength)
l2_grid = np.logspace(-4, 0, 7)
results = []
best_overall = None

for l2 in l2_grid:
    best, _hist = fit_logreg_gd(
        X_train_s, y_train, X_val_s, y_val, lr=0.1, l2=float(l2), n_epochs=800, eval_every=20
    )
    results.append({"l2": float(l2), "best_val_ap": best["ap"], "best_epoch": best["epoch"]})
    if best_overall is None or best["ap"] > best_overall["best_val_ap"]:
        best_overall = {"l2": float(l2), "best_val_ap": best["ap"], "best": best}

df_grid = pd.DataFrame(results).sort_values("l2")
df_grid
l2 best_val_ap best_epoch
0 0.000100 0.326853 1
1 0.000464 0.326853 1
2 0.002154 0.326853 1
3 0.010000 0.326853 1
4 0.046416 0.326853 1
5 0.215443 0.326853 1
6 1.000000 0.327212 760
fig = px.line(
    df_grid,
    x="l2",
    y="best_val_ap",
    markers=True,
    log_x=True,
    title="Validation AP vs L2 strength (pick the maximum)",
)
fig.update_layout(xaxis_title="L2 strength (lambda)", yaxis_title="Best validation AP")
fig.show()

best_l2 = best_overall["l2"]
w_star = best_overall["best"]["w"]
b_star = best_overall["best"]["b"]

test_ap_star = average_precision_score_np(y_test, predict_proba_logreg(X_test_s, w_star, b_star))
best_l2, best_overall["best_val_ap"], test_ap_star
(1.0, 0.3272119853443016, 0.3165096666872798)

7) Pros, cons, and common pitfalls#

Pros#

  • Good for imbalanced data: focuses on the positive class; baseline is the prevalence.

  • Threshold-free summary: evaluates the whole ranking, not just one operating point.

  • Ranking metric: invariant to any strictly increasing transform of scores.

Cons / pitfalls#

  • Depends on prevalence: AP values aren’t directly comparable across datasets with different base rates.

  • High variance with few positives: a handful of positives can swing AP a lot.

  • Not symmetric: swapping the positive/negative label changes AP.

  • Not differentiable in a simple way: optimizing AP directly usually requires surrogate losses (pairwise ranking, smooth approximations, etc.).

  • Not an operating-point metric: if you care about a specific constraint (e.g., precision ≥ 0.9), also inspect the PR curve or use precision@k / recall@k.

  • Integration convention: AP (step function) differs from trapezoidal AUPRC; don’t mix them silently.

Exercises#

  1. Show numerically (with random data) that:

  • the step-area formula \(\sum_n (R_n - R_{n-1})P_n\) equals the “mean precision at true-positive ranks” formula.

  1. Compare ROC AUC vs AP on a heavily imbalanced dataset. Create two models with similar ROC AUC but very different AP.

  2. Multilabel: generate labels with different prevalences per label and compare average="macro" vs average="micro".

References#

  • scikit-learn API docs: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html

  • scikit-learn PR curve: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html